PRIMARY FILTERING
Before any markers can be categorized and investigated further, it is important to determine which clusters identified by Rmixmod are biologically plausable and which ones are more likely to be ‘noise’. My theory is that this ‘noise’ is due to many factors such as poor quality DNA, contamination, technician/machine error, etc. While the term ‘noise’ is somewhat vague, many of these clusters are very identifiable when looking at plot outputs. The assumption I made when choosing this clustering method is that some ‘noise’ effects are relatively small (stochastic pipette error, machine error, batch variance) and some are big (bad DNA, technician mistakes), but all are somewhat random. However, the SNP effect is less random and is instead based on biochemistry. Also, there is a polyploid effect where homeologs and homologs can alter the assay signal. Therefore, the ultimate polar theta-r coordinate that a sample falls is a combination of the SNP effect, the polyploid effect, small noise, and big noise.
My strategy for determining whether an identified cluster is a distribution based on biochemistry or noise is to use a machine learning algorithm to automatically determine whether to keep or remove a cluster. First, I will create a training dataset where I have manually classified all clusters in a relatively small but representative subset of markers. Then I will use this training dataset to teach a machine learning algorithm the best criteria to use to divide the clusters. After these criteria are determined, this model can then be used to classify the rest of the clusters.
Example Marker Cluster Plots
## Warning in chol.default(shape, pivot = TRUE): the matrix is either rank-
## deficient or indefinite
Note that this marker clearly has three major clusters and two ‘noisy’ clusters that look like outliers and not part of the major clusters.
This marker demonstrates how difficult this process is. There appears to be a 3-cluster SNP compressed on the right hand of the graph with two obvious ‘noise’ clusters. However, there is also an outlier cluster centered around (0.5,2) that may actaully be biologically real. Just to test this theory, I extracted which samples were contained in this cluster:
## theta r cluster
## Bau/Far-b1-1.Theta 0.4219012 2.132922 6
## Bau/Far-b1-34.Theta 0.4515495 2.248773 6
## Bau/Far-b1-36.Theta 0.5638779 2.039139 6
## Bau/Far-b1-39.Theta 0.5059506 1.876826 6
## Bau/Far-b1-40.Theta 0.5714495 1.949252 6
## Bau/Far-b1-41.Theta 0.4110552 2.166358 6
## Bau/Far-b1-46.Theta 0.4516166 2.320468 6
## Bau/Far-b1-47.Theta 0.4453222 2.167106 6
## Bau/Far-b1-48.Theta 0.4647934 2.102957 6
## Bau/Far-b1-49.Theta 0.5487349 1.878951 6
## Bau/Far-b1-50.Theta 0.5338429 2.040202 6
## Bau/Far-b1-51.Theta 0.4563138 2.185028 6
## Bau/Far-b1-53.Theta 0.4787483 2.071924 6
## Bau/Far-b1-55.Theta 0.5247312 1.933218 6
## Bau/Far-b1-56.Theta 0.4369121 2.180591 6
## Bau/Far-b1-58.Theta 0.5487795 1.974176 6
## Bau/Far-b2-129.Theta 0.5316229 1.936603 6
## Bau/Far-b2-131.Theta 0.4649383 2.177950 6
## Bau/Far-b2-132.Theta 0.5181820 1.998634 6
## Bau/Far-b2-133.Theta 0.5451783 1.926294 6
## Bau/Far-b2-134.Theta 0.5410533 2.097090 6
## Bau/Far-b2-137.Theta 0.5560583 1.914551 6
## Bau/Far-b2-138.Theta 0.4261789 1.964600 6
## Bau/Far-b2-149.Theta 0.4757960 2.112176 6
## Bau/Far-b2-150.Theta 0.5169359 1.899830 6
## Bau/Far-b2-153.Theta 0.5155960 1.963779 6
## Bau/Far-b2-155.Theta 0.5452015 1.942958 6
## Bau/Far-b2-187.Theta 0.4375570 2.133880 6
## AHR-Bauermeister.Theta 0.5293083 2.002929 6
## Other-Nor/Phi-b2-9-A2.Theta 0.5503911 2.093022 6
## Other-Caleb-Tucson E03.Theta 0.5246291 2.021065 6
## Other-Caleb-Tucson C06.Theta 0.5489718 2.041049 6
## QAM2-J960678-5.Theta 0.5741200 1.813585 6
Creating some training data
Before I can do any machine learning, I need a set of training data. To do this, I want to sample a subset of markers that represent the spectrum of cluster variation present in my entire set of markers. To do this, I sub-sampled from markers with a pre-filter cluster count. So I want 100 markers with 2 clusters, 100 markers with 3, 100 with 4, and 100 with 5+. I plotted each marker and manually classify each cluster for whether I think it is a biological or noise cluster for a total of 400 markers (~1400 clusters). Also, I calculated several different characteristics that I think will help. These are: * prop = the proportion of samples in the cluster * xcoord = the x-coordinate center * ycoord = the y-coordinate center * var1 = the 1st variance component (calculated from rotational covariance matrix) * var2 = the 2nd variance component * xvar = the theta-associated variance * yvar = the R-associated variance
# Extracting the number of raw, unfiltered clusters identified for each marker
clustnum = data.frame(matrix(nrow = (length = length(mixmodout)), ncol = 2))
for (i in 1:length(mixmodout)) {
clustnum[i,1] <- mixmodout[[i]][[2]]
clustnum[i,2] <- mixmodout[[i]][[1]][1]
}
# Indexing marker category
clust_2 <- which(clustnum[,2] == 2)
clust_3 <- which(clustnum[,2] == 3)
clust_4 <- which(clustnum[,2] == 4)
clust_5 <- which(clustnum[,2] > 5)
# Creating the sample dataset
set.seed(3)
clust_2_samp <- sample(clust_2, 100)
set.seed(3)
clust_3_samp <- sample(clust_3, 100)
set.seed(3)
clust_4_samp <- sample(clust_4, 100)
set.seed(3)
clust_5_samp <- sample(clust_5, 100)
training_index <- c(clust_2_samp, clust_3_samp, clust_4_samp, clust_5_samp)
# Extracting cluster information
training_data <- extract.clust.stats(training_index, theta_frame, r_frame, mixmodout)
# Plotting the first 100 markers
for (n in clust_2_samp) {
marker_name <- mixmodout[[n]][[2]]
file_name <- paste(which(clust_2_samp == n), '_', marker_name, '.png', sep = "")
clustplot <- plot.raw.clust(n, theta_frame, r_frame, mixmodout)
ggsave(paste(file_name, '.png', sep = ""), plot = clustplot, device = 'png')
}
# Plotting markers 201:300
for (n in clust_3_samp) {
marker_name <- mixmodout[[n]][[2]]
file_name <- paste(which(clust_3_samp == n), '_', marker_name, '.png', sep = "")
clustplot <- plot.raw.clust(n, theta_frame, r_frame, mixmodout)
ggsave(paste(file_name, '.png', sep = ""), plot = clustplot, device = 'png')
}
# Plotting markers 301:400
for (n in clust_4_samp) {
marker_name <- mixmodout[[n]][[2]]
file_name <- paste(which(clust_4_samp == n), '_', marker_name, '.png', sep = "")
clustplot <- plot.raw.clust(n, theta_frame, r_frame, mixmodout)
ggsave(paste(file_name, '.png', sep = ""), plot = clustplot, device = 'png')
}
# Plotting markers 401:500
for (n in clust_5_samp) {
marker_name <- mixmodout[[n]][[2]]
file_name <- paste(which(clust_5_samp == n), '_', marker_name, '.png', sep = "")
clustplot <- plot.raw.clust(n, theta_frame, r_frame, mixmodout)
ggsave(paste(file_name, '.png', sep = ""), plot = clustplot, device = 'png')
}
# These plots were then used to score each cluster whether I thought it was noise or real
Here’s what the training set looks like:
# Reading in the training data
data.raw <- read.table('/home/mmcgowan/Dropbox/M.McGowanSharedFiles/Wheat_data/genotypeclustering/kamiak_clust/plot_examples/training_data.csv', header = T, sep = '\t', stringsAsFactors = F)
data.raw[1:10,]
## category marker cluster prop xcoord
## 1 1 D_contig22790_269 1 0.97991071 0.0916442
## 2 0 D_contig22790_269 2 0.02008929 0.2377851
## 3 1 RFL_Contig1416_3921 1 0.97842262 0.3663278
## 4 0 RFL_Contig1416_3921 2 0.02157738 0.3733695
## 5 1 Excalibur_rep_c71163_225 1 0.98511905 0.0179073
## 6 0 Excalibur_rep_c71163_225 2 0.01488095 0.1732606
## 7 1 Excalibur_c4834_2003 1 0.02008929 0.3574809
## 8 0 Excalibur_c4834_2003 2 0.97991071 0.5045603
## 9 1 Kukri_c80964_72 1 0.98288690 0.5906906
## 10 0 Kukri_c80964_72 2 0.01711310 0.4227292
## ycoord var1 var2 xvar yvar
## 1 0.4664488 0.002187762 0.0003615288 3.836061e-04 0.002167622
## 2 0.2203040 0.045530162 0.0142274518 4.495702e-02 0.017098959
## 3 1.2846571 0.004393299 0.0003716721 3.719550e-04 0.004396643
## 4 0.7909119 0.271798304 0.0179255230 1.856572e-02 0.281505386
## 5 0.9159507 0.005038310 0.0000219567 2.197329e-05 0.005042118
## 6 0.4903193 0.184138557 0.0208900017 2.198948e-02 0.193830060
## 7 0.3099243 0.090752681 0.0222673287 6.052143e-02 0.056845508
## 8 0.5040303 0.003222514 0.0011285464 1.310501e-03 0.003043866
## 9 2.0697504 0.009369652 0.0006033352 6.242720e-04 0.009356271
## 10 1.1455948 0.956871809 0.0331559054 7.892750e-02 0.956101473
Training a model
Now I want to try using a boosted tree method to determine a classification strategy I found an online guide useful for this: Gradient boosting model guide
# Creating a formatted data.frame for analyis using gbm
data <- data.frame(data.raw$marker, data.raw$cluster, data.raw$prop, data.raw$xcoord, data.raw$ycoord, data.raw$var1, data.raw$var2, data.raw$category)
names(data) <- c('marker', 'cluster', 'prop', 'xcoord', 'ycoord', 'var1', 'var2', 'category')
# Converting the classifier into a factor
data$category <- ifelse(data$category==1,'real','noise')
data$category <- as.factor(data$category)
# Splitting the data into a training and test set
set.seed(23)
sample <- sample(1:nrow(data), round(0.1 * nrow(data)))
data.train <- data[-sample,]
data.test <- data[sample,]
# Configuring Caret to automatically control the resampling/testing of my data (training it 10 times in this case)
objControl <- trainControl(method='cv', number=10, returnResamp='none', summaryFunction = twoClassSummary, classProbs = TRUE)
# Setting outcome and predictors
outcome_name <- 'category'
predictor_names <- c('prop', 'xcoord', 'ycoord', 'var1', 'var2')
# Running the model
model_gbm <- train(data.train[,predictor_names], data.train[,outcome_name],
method = 'gbm',
trControl=objControl,
metric = 'ROC',
preProc = c('center', 'scale')
)
## Loading required package: plyr
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2273 -nan 0.1000 0.0800
## 2 1.0947 -nan 0.1000 0.0653
## 3 0.9854 -nan 0.1000 0.0542
## 4 0.8906 -nan 0.1000 0.0458
## 5 0.8085 -nan 0.1000 0.0385
## 6 0.7413 -nan 0.1000 0.0325
## 7 0.6844 -nan 0.1000 0.0284
## 8 0.6336 -nan 0.1000 0.0250
## 9 0.5907 -nan 0.1000 0.0214
## 10 0.5531 -nan 0.1000 0.0172
## 20 0.3433 -nan 0.1000 0.0036
## 40 0.2362 -nan 0.1000 0.0001
## 50 0.2202 -nan 0.1000 -0.0002
# Reading in the boosted model
readRDS(model_gbm, file = '/home/mmcgowan/Dropbox/M.McGowanSharedFiles/Wheat_data/genotypeclustering/kamiak_clust/model_gbm.RDS')
Evaluating the GBM model
Now that I have a model, I will investigate what the properties of this model are.
The first thing I will look at is what variables were most important for determination:
summary(model_gbm)
## var rel.inf
## prop prop 91.7051565
## var2 var2 6.5368428
## ycoord ycoord 0.7219190
## var1 var1 0.6841521
## xcoord xcoord 0.3519296
It looks like the primary contributor is the proportion of samples in the cluster, which makes a lot of sense! Then the next contributor is the var2 component. The other components don’t seem to be helping much in classifying the clusters.
Now to test what kind of accuracy this model has by using our testing data I set aside.
predictions <- predict(object=model_gbm, data.test[,predictor_names], type = 'raw')
head(predictions)
## [1] noise noise noise noise real noise
## Levels: noise real
print(postResample(pred = predictions, obs = as.factor(data.test[,outcome_name])))
## Accuracy Kappa
## 0.9677419 0.9354436
That looks pretty good, 97% of the testing clusters I set aside were classified correctly!
Lets look to see what clusters were wrongly identified.
which(data.test$category != predictions)
## [1] 10 12 77 120 140
data.test[which(data.test$category != predictions),]
## marker cluster prop xcoord ycoord
## 1536 BS00076596_51 1 0.06994048 0.65176497 0.7741730
## 1080 wsnp_Ex_c19371_28311667 4 0.05282738 0.58338765 0.7009184
## 1188 GENE-2906_249 2 0.03497024 1.00000000 0.7255342
## 1530 Excalibur_c17206_256 1 0.02678571 0.71990417 1.2360362
## 379 Excalibur_c115824_342 2 0.06622024 0.03714632 0.5056953
## var1 var2 category
## 1536 0.040573725 1.810731e-02 real
## 1080 0.044422275 4.332317e-03 noise
## 1188 0.002180045 3.710452e-10 noise
## 1530 0.004698476 7.934533e-04 real
## 379 0.022747278 9.359500e-05 noise
x <- which(theta_frame$Name == 'BS00076596_51')
plot.raw.clust(x, theta_frame, r_frame, mixmodout)
x <- which(theta_frame$Name == 'wsnp_Ex_c19371_28311667')
plot.raw.clust(x, theta_frame, r_frame, mixmodout)
x <- which(theta_frame$Name == 'Excalibur_c115824_342')
plot.raw.clust(x, theta_frame, r_frame, mixmodout)
Looking at these plots, it makes a lot of sense why the first two plots misclassified the center ‘real’ clusters as ‘noise’. Despite there being a clear distribution in the middle, it is very debatable whether they could be considered ‘real’.
The third plot example is more confusing. There seems to be a cluster nested in the primary one that has a very tight distribution, but also contains a fair proportion of samples (7%). This actually seems to be more of a problem with the Rmixmod results rather than the classification model. I think it may be possible to identify ‘real’ clusters that could be merged after filtering.
Using the classification model to filter clusters.
Now that I have a reasonably accurate model to predict whether a cluster is ‘real’ or ‘noise’, I now apply the model to every cluster.
# Extracting stats for all markers
all_clust_stats <- extract.clust.stats(indices = 1:length(mixmodout), theta_frame, r_frame, mixmodout)
# Reading in all_clust_stats (to improve Knit time)
readRDS(all_clust_stats, file = '/home/mmcgowan/Dropbox/M.McGowanSharedFiles/Wheat_data/genotypeclustering/kamiak_clust/all_clust_stats.RDS')
# Predicting whether a cluster is 'real' or 'noise'
all_clust_predict <- predict(object=model_gbm, all_clust_stats[,predictor_names], type = 'raw')
# Adding these predictions to the cluster stats frame, but coercing it into a logical
all_clust_stats$category <- all_clust_predict
all_clust_stats$cat_bin <- as.logical(as.numeric(all_clust_predict)-1)
# Calculating how many clusters remained for each marker after remoing 'noise'
clustnum_filt <- vector(mode = 'integer', length = length(mixmodout))
for (i in 1:length(mixmodout)) {
cat('\r', 'Processing marker ', i, ' out of ', length(mixmodout), sep = '')
marker_name <- mixmodout[[i]][[2]]
real_clust <- which(all_clust_stats$marker == marker_name & all_clust_stats$cat_bin)
clustnum_filt[i] <- length(real_clust)
}
# Storing data in a viewable frame
clust_num_frame <- data.frame(matrix(NA, ncol = 3, nrow = length(clustnum)))
names(clust_num_frame) <- c('marker', 'prefilter', 'postfilter')
clust_num_frame[,1] <- theta_frame$Name
clust_num_frame[,2] <- clustnum
clust_num_frame[,3] <- clustnum_filt
Then I plot a distribution of the post-filter cluster count for each marker and output a table
qplot(clust_num_frame$postfilter, bins = 9) + ggtitle('Histogram of Post-Filter Cluster Count')
table(clust_num_frame$postfilter)
##
## 0 1 2 3 4 5 6 7 8
## 283 39340 17681 17453 4441 1639 571 167 12
Based on this table, roughly half of the 81k markers are monomorphic. Also I noticed that there were 283 markers that had zero clusters…weird. So I plotted a few of them:
zero_clust <- which(clust_num_frame$postfilter == 0)[1:10]
zero_clust
## [1] 180 387 1642 2139 2973 3992 4093 5906 6548 6984
for (i in zero_clust) {
print(plot.raw.clust(i, theta_frame, r_frame, mixmodout))
}
Hmm, for some reason, the centers of the clusters are all at (0,0) and their variances are very big. Why did the clustering centers fail in this case? I’m not sure, but I will table this for later….
Nevertheless, if a majority of the 2 and 3 cluster markers are correct, I will still have a lot of useful markers (35k markers). Let’s look at a subset of these markers.
two_clust <- which(clust_num_frame$postfilter == 2)[1:20]
for (i in two_clust) {
marker_name <- mixmodout[[i]][[2]]
file_name <- paste(marker_name, '.png', sep = "")
clustplot <- plot.raw.clust(i, theta_frame, r_frame, mixmodout)
ggsave(paste(file_name, '.png', sep = ""), plot = clustplot, device = 'png')
}
three_clust <- which(clust_num_frame$postfilter == 3)[1:20]
for (i in three_clust) {
marker_name <- mixmodout[[i]][[2]]
file_name <- paste(marker_name, '.png', sep = "")
clustplot <- plot.raw.clust(i, theta_frame, r_frame, mixmodout)
ggsave(paste(file_name, '.png', sep = ""), plot = clustplot, device = 'png')
}
two_clust <- which(clust_num_frame$postfilter == 2)[1:5]
for (i in two_clust) {
print(plot.raw.clust(i, theta_frame, r_frame, mixmodout))
}
## Warning in chol.default(shape, pivot = TRUE): the matrix is either rank-
## deficient or indefinite
three_clust <- which(clust_num_frame$postfilter == 3)[1:5]
for (i in two_clust) {
print(plot.raw.clust(i, theta_frame, r_frame, mixmodout))
}
## Warning in chol.default(shape, pivot = TRUE): the matrix is either rank-
## deficient or indefinite
Ok, most all of these markers seem to make sense. However, 3 cluster markers seem to fall into two different situations where they are all equidistant and some are not.
Filtering 3-cluster markers
Because I assume the effect of a single SNP to have an additive effect on the theta readout, I want to test using a second filter to further filter 3 cluster markers to only ones that follow the equidistant assumption.
three_clust <- which(clust_num_frame$postfilter == 3)
three_clust_names <- as.character(theta_frame$Name[three_clust])
three_clust_stats <- all_clust_stats[all_clust_stats$marker %in% three_clust_names & all_clust_stats$cat_bin == T,]
three_clust_ratio <- vector(mode = 'integer', length = length(three_clust_names))
three_clust_mindist <- vector(mode = 'integer', length = length(three_clust_names))
for (i in 1:length(three_clust_names)) {
xcoord <- three_clust_stats$xcoord[three_clust_stats$marker == three_clust_names[i]]
distances <- sort(dist(xcoord))
ratio <- distances[1] / distances[2]
three_clust_mindist[i] <- min(distances)
three_clust_ratio[i] <- ratio
}
three_clust_dist_stats <- data.frame(three_clust_names, three_clust_mindist, three_clust_ratio, stringsAsFactors = F)
names(three_clust_dist_stats) <- c('names', 'mindist', 'ratio')
Plotting a histogram of the minimum cluster distance for 3 cluster markers
qplot(three_clust_dist_stats$mindist) + ggtitle('3 Cluster marker: Histogram of minimum x-distance')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Plotting a histogram of the ratio of the distance between the left to middle to right clusters
qplot(three_clust_dist_stats$mindist) + ggtitle('3 Cluster marker: Histogram of left to right cluster distance ratio')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Clearly there is a strong bimodal distribution for the minimum distance. My guess is that these are markers where there was one or zero-value truncation causing two clusters to be found very close together. Using a cutpoint of 0.025 - 0.05 should filter these out.
Also, there is another bimodal split between markers with a ratio roughly equidistant (~1) and non-equal (~0). A cutpoint between 0.25 and 0.5 should take care of these.
three_clust_filter2_list <- three_clust_dist_stats$names[three_clust_dist_stats$mindist > 0.025 & three_clust_dist_stats$ratio > 0.375]
Now lets look at a random set of 5.
set.seed(3)
sample_list <- sample(three_clust_filter2_list, 5)
sample_index <- which(theta_frame$Name %in% sample_list)
for (i in 1:length(sample_index)) {
index <- sample_index[i]
marker_name <- mixmodout[[index]][[2]]
file_name <- paste(i, '_', marker_name, '.png', sep = "")
clustplot <- plot.filter.clust(index, theta_frame, r_frame, mixmodout, all_clust_stats)
ggsave(paste(file_name, '.png', sep = ""), plot = clustplot, device = 'png')
}
for (i in 1:length(sample_index)) {
index <- sample_index[i]
marker_name <- mixmodout[[index]][[2]]
clustplot <- plot.filter.clust(index, theta_frame, r_frame, mixmodout, all_clust_stats)
print(clustplot)
}
These all look great! Just to be sure, lets look at the most extreme ratio values that are right above my cutpoint.
three_clust_filter2 <- three_clust_dist_stats[three_clust_dist_stats$names %in% three_clust_filter2_list,]
extreme_test <- three_clust_filter2[order(three_clust_filter2$ratio),][1:5,1]
sample_index <- which(theta_frame$Name %in% extreme_test)
for (i in 1:length(sample_index)) {
index <- sample_index[i]
marker_name <- mixmodout[[index]][[2]]
clustplot <- plot.filter.clust(index, theta_frame, r_frame, mixmodout, all_clust_stats)
print(clustplot)
}
Even the extreme examples still make sense from a clustering perspective. Considering that these are the minority of my three cluster markers, I think my filters have worked in picking out three cluster markers that match eyeball expectations for a bi-allelic marker with three genotypes (AA, AB, BB).
Filtering 2-cluster markers
Now I want to try filtering 2-cluster markers. Like before, I calculated the minimum distance between clusters (since there are only two, there is only one distance). Then I plotted a histogram of these distances.
two_clust <- which(clust_num_frame$postfilter == 2)
two_clust_names <- as.character(theta_frame$Name[two_clust])
two_clust_stats <- all_clust_stats[all_clust_stats$marker %in% two_clust_names & all_clust_stats$cat_bin == T,]
two_clust_mindist <- vector(mode = 'integer', length = length(two_clust_names))
for (i in 1:length(two_clust_names)) {
xcoord <- two_clust_stats$xcoord[two_clust_stats$marker == two_clust_names[i]]
distances <- sort(dist(xcoord))
ratio <- distances[1] / distances[2]
two_clust_mindist[i] <- min(distances)
}
two_clust_dist_stats <- data.frame(two_clust_names, two_clust_mindist, stringsAsFactors = F)
names(two_clust_dist_stats) <- c('names', 'mindist')
qplot(two_clust_dist_stats$mindist) + ggtitle('Histogram of two-cluster marker x-coordinate center distance')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Weird, the distribution appears to be tetra-modal. The modes are roughly: (0, 0.125, 0.35, 0.9). The 0.9 makes the most sense for biallelic SNPs with no homeologues since AA would be theta = 0 and BB would be theta = 1. However, with homeologue interaction, 0.35 also makes some sense. Before making any more conjectures, I plotted a random sample from each mode.
cat1_names <- sample(two_clust_dist_stats$names[two_clust_dist_stats$mindist < 0.025], 5)
cat2_names <- sample(two_clust_dist_stats$names[two_clust_dist_stats$mindist > 0.05 & two_clust_dist_stats$mindist < 0.1], 5)
cat3_names <- sample(two_clust_dist_stats$names[two_clust_dist_stats$mindist > 0.3 & two_clust_dist_stats$mindist < 0.4], 5)
cat4_names <- sample(two_clust_dist_stats$names[two_clust_dist_stats$mindist > 0.8], 5)
cat1_index <- which(theta_frame$Name %in% cat1_names)
cat2_index <- which(theta_frame$Name %in% cat2_names)
cat3_index <- which(theta_frame$Name %in% cat3_names)
cat4_index <- which(theta_frame$Name %in% cat4_names)
5 Plots for markers with an x-distance < 0.025
## Warning: Removed 26 rows containing missing values (geom_path).
5 Plots for markers with an x-distance > 0.05 & < 0.1
for (i in 1:length(cat2_index)) {
index <- cat2_index[i]
marker_name <- mixmodout[[index]][[2]]
clustplot <- plot.filter.clust(index, theta_frame, r_frame, mixmodout, all_clust_stats)
print(clustplot)
}
5 Plots for markers with an x-distance > 0.3 & < 0.4
for (i in 1:length(cat3_index)) {
index <- cat3_index[i]
marker_name <- mixmodout[[index]][[2]]
clustplot <- plot.filter.clust(index, theta_frame, r_frame, mixmodout, all_clust_stats)
print(clustplot)
}
5 Plots for markers with an x-distance > 0.8
for (i in 1:length(cat4_index)) {
index <- cat4_index[i]
marker_name <- mixmodout[[index]][[2]]
clustplot <- plot.filter.clust(index, theta_frame, r_frame, mixmodout, all_clust_stats)
print(clustplot)
}
Based on these samples, all categories seem ok except for the first. Some of them do seem to have a weak middle cluster that was flagged as noise, but I think the filtering is still acceptable.
Extracting filtered markers and calling genotypes
First, I need to extract a list of marker names that I want to keep. Then I subset the partition file to only include markers in the list.
two_cluster_filter_list <- two_clust_dist_stats$names[two_clust_dist_stats$mindist > 0.0375]
total_clust_filter_list <- c(two_cluster_filter_list, three_clust_filter2_list)
Then I call the extract.genotype function. This function checks whether the marker is a 2 or 3 cluster marker and then assigns genotypes according to the theta value order of the clusters. So from left to right, a 2 cluster marker will be (0,2) and a 3 cluster marker will be (0,1,2).
genotype_matrix <- extract.genotype(total_clust_filter_list, theta_frame, r_frame, mixmodout, all_clust_stats)
row.names(genotype_matrix) <- names(theta_frame)[-1]
Subsetting QAM samples, fixing sample names, and saving for later
To figure out which samples are QAM, I checked the row names for my genotype matrix. After determining which ones were QAM, I subset them.
Next I used an Excel file provided by Dr. Carter to create a name conversion table. I visually checked to confirm the original order matches my genotype frame and then replaced the genotype matrix row names with the corrected sample names from the conversion table. Then I saved the R object as an RDS file for further analysis.
# Subset only QAM samples
QAM_genotypes <- genotype_matrix[841:1320,]
# Read in the conversion table
sample_guide <- read.table('/home/mmcgowan/Dropbox/M.McGowanSharedFiles/Wheat_data/genotypeclustering/sampleconversion_guide.csv', sep = '\t', header = T)
# Replace the sample names
row.names(QAM_genotypes) <- sample_guide$Corrected
# Saving the genotype matrix (in binary format for easier reading)
saveRDS(QAM_genotypes, file = 'QAM_genotypes.RDS')